156 research outputs found

    A mesocosm experiment investigating the effects of substratum quality and wave exposure on the survival of fish eggs

    Get PDF
    In a mesocosm experiment, the attachment of bream (Abramis brama) eggs to spawning substrata with and without periphytic biofilm coverage and their subsequent survival with and without low-intensity wave exposure were investigated. Egg attachment was reduced by 73% on spawning substrata with a natural periphytic biofilm, compared to clean substrata. Overall, this initial difference in egg numbers persisted until hatching. The difference in egg numbers was even increased in the wave treatment, while it was reduced in the no-wave control treatment. Exposure to a low-intensity wave regime affected egg development between the two biofilm treatments differently. Waves enhanced egg survival on substrata without a biofilm but reduced the survival of eggs on substrata with biofilm coverage. In the treatment combining biofilm-covered substrata and waves, no attached eggs survived until hatching. In all treatments, more than 75% of the eggs became detached from the spawning substrata during the egg incubation period, an

    Multilabel Classification with R Package mlr

    Full text link
    We implemented several multilabel classification algorithms in the machine learning package mlr. The implemented methods are binary relevance, classifier chains, nested stacking, dependent binary relevance and stacking, which can be used with any base learner that is accessible in mlr. Moreover, there is access to the multilabel classification versions of randomForestSRC and rFerns. All these methods can be easily compared by different implemented multilabel performance measures and resampling methods in the standardized mlr framework. In a benchmark experiment with several multilabel datasets, the performance of the different methods is evaluated.Comment: 18 pages, 2 figures, to be published in R Journal; reference correcte

    Learning Multiple Defaults for Machine Learning Algorithms

    Get PDF
    The performance of modern machine learning methods highly depends on their hyperparameter configurations. One simple way of selecting a configuration is to use default settings, often proposed along with the publication and implementation of a new algorithm. Those default values are usually chosen in an ad-hoc manner to work good enough on a wide variety of datasets. To address this problem, different automatic hyperparameter configuration algorithms have been proposed, which select an optimal configuration per dataset. This principled approach usually improves performance, but adds additional algorithmic complexity and computational costs to the training procedure. As an alternative to this, we propose learning a set of complementary default values from a large database of prior empirical results. Selecting an appropriate configuration on a new dataset then requires only a simple, efficient and embarrassingly parallel search over this set. We demonstrate the effectiveness and efficiency of the approach we propose in comparison to random search and Bayesian Optimization

    Hyperparameters, tuning and meta-learning for random forest and other machine learning algorithms

    Get PDF
    In this cumulative dissertation thesis, I examine the influence of hyperparameters on machine learning algorithms, with a special focus on random forest. It mainly consists of three papers that were written in the last three years. The first paper (Probst and Boulesteix, 2018) examines the influence of the number of trees on the performance of a random forest. In general it is believed that the number of trees should be set higher to achieve better performance. However, we show some real data examples in which the expectation of measures such as accuracy and AUC (partially) decrease with growing numbers of trees. We prove theoretically why this can happen and argue that this only happens in very special data situations. For other measures such as the Brier score, the logarithmic loss or the mean squared error, we show that this cannot happen. In a benchmark study based on 306 classification and regression datasets, we illustrate the extent of this unexpected behaviour. We observe that, on average, most of the improvement regarding performance can be achieved while growing the first 100 trees. We use our new OOBCurve R package (Probst, 2017a) for the analysis, which can be used to examine performances for a growing number of trees of a random forest based on the out-of-bag observations. The second paper (Probst et al., 2019b) is a more general work. Firstly we review literature about the influence of hyperparameters on random forest. The different hyperparameters considered are the number of variables drawn at each split, the sampling scheme for drawing observations for each tree, the minimum number of observations in a node that a tree is allowed to have, the number of trees and the splitting rule. Their influence is examined regarding performance, runtime and variable importance. In the second part of the paper different tuning strategies for obtaining optimal hyperparameters are presented. A new software package in R is introduced, tuneRanger. It executes the tuning strategy sequential model-based optimization based on the out-of-bag observations. The hyperparameters and ranges for tuning are chosen automatically. In a benchmark study this implementation is compared with other different implementations that execute tuning for random forest. The third paper (Probst et al., 2019a) is even more general and presents a general framework for examining the tunability of hyperparameters of machine learning algorithms. It first defines the concept of defaults properly and proposes definitions for measuring the tunability of the whole algorithm, of single hyperparameters and of combinations of hyperparameters. To apply these definitions to a collection of 38 binary classification datasets, a random bot is created, which generated in total around 5 million experiment runs of 6 algorithms with different hyperparameters. The details of this bot are described in an extra paper (KĂŒhn et al., 2018), co-authored by myself, that is also included in this dissertation. The results of this bot are used to estimate the tunability of these 6 algorithms and their specific hyperparameters. Furthermore, ranges for parameter tuning of these algorithms are proposed

    The times they are a-changin’

    Get PDF
    Dem wachsenden Energieverbrauch von Haushalten werden technologischen Neuheiten als Lösung entgegen gehalten. Allerdings sind weder deren Auswirkungen auf das Verhalten der Menschen noch auf die Umwelt klar. Diese Diplomarbeit hat es sich daher zum Ziel gesetzt, ausgehend von einer Diskussion ĂŒber KonsumentInnenverhalten und Rebound Effekte, VerĂ€nderungen im Energieverbrauch von Haushalten zu untersuchen. HierfĂŒr wurde eine die VerĂ€nderungen in der Zeit ins Zentrum stellende, auf der Analyse von AktivitĂ€ten basierende Methode angewandt. Zeitallokationen werden hierbei als ein Weg gesehen, Verhaltensmuster zu beschreiben und ihre VerĂ€nderungen, beispielsweise durch die EinfĂŒhrung einer neuen Technologie, abzubilden. Im Rahmen dieser Diplomarbeit wurde der Einfluss des Personal Computers (PC) auf UK-Haushalte in den Jahren von 1999 bis 2001 untersucht. Umweltdaten und Zeitverwendungsstatistiken wurden verknĂŒpft, um die Unterschiede zwischen einer Gruppe von Menschen, die einen PC in den Haushalt neu integrieren, und einer Kontrollgruppe zu analysieren. Hierdurch konnten einerseits die Substitutionseffekte zwischen einzelnen AktivitĂ€ten, sowie andererseits die Auswirkungen auf den Energieverbrauch beschrieben und der Einfluss neuer Technologien herausgearbeitet werden. Die Ergebnisse deuten darauf hin, dass sich die EinfĂŒhrung eines PCs positiv auf die EnergieintensitĂ€ten auswirkt, da AktivitĂ€ten mit niedriger und mittlerer EnergieintensitĂ€t AktivitĂ€ten mit hoher IntensitĂ€t ersetzen und den Energieverbrauch dadurch verringern. Die Ergebnisse sind allerdings nicht allgemein gĂŒltig, da sich zwischen verschiedenen Subgruppen unterschiedliche Bilder ergeben.Technological innovation is promoted as one way to cope with increasing energy consumption in the household sector. The behavioral and environmental consequences caused by the introduction of new technological devices in the household are, however, unclear at best. Starting from a discussion on consumer behavior and rebound effects, the purpose of this study was to analyze changes in households‘ energy consumption by adopting a temporal, activity based method. Time use patterns are seen as a way to describe behavioral patterns, opening up the possibility to model changes happening after the adoption of new technology as changing time use. The study analyzed the impact of the personal computer on UK households in the period 1999 to 2001. Combining environmental data with statistics on time use, it was possible to model short term changes in time use patterns comparing a group of pc adopters and a group not adopting a personal computer. This allowed for an analysis of substitution effects between different household activities as well as the consequences on energy consumption, focusing on the possible influences triggered by the new technology. The results indicates that the adoption of a personal computer has a beneficial environmental effect as low and middle intensity activities are substitutes for high-intensity activities, resulting in a decreasing energy demand. However the results are inconclusive as further analysis distinguishing between different subgroups (age, gender, and household size) seems to suggest different trends

    Random forest versus logistic regression: A large-scale benchmark experiment

    Get PDF
    BACKGROUND AND GOAL The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. RESULTS In this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases. CONCLUSION RF performed better than LR according to the considered accuracy measured in approximately 69% of the datasets. The mean difference between RF and LR was 0.029 (95%-CI =0.022,0.038) for the accuracy, 0.041 (95{\%}-CI =0.031,0.053) for the Area Under the Curve, and - 0.027 (95{\%}-CI =-0.034,-0.021) for the Brier score, all measures thus suggesting a significantly better performance of RF. As a side-result of our benchmarking experiment, we observed that the results were noticeably dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. We also stress that neutral studies similar to ours, based on a high number of datasets and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values
    • 

    corecore